Introduction

Background Information

This dataset has information pulled from the Global Health Observatory (GHO) data repository under the World Health Organization (WHO), with additional country specific information collected by the United Nations (UN). There are 22 variables and 2938 observations in this dataset, with 20 of those variables being predicting variables.

Dataset Citation

Description of Dataset

This dataset describes life expectancy of 193 different countries, along with various other factors that may impact life expectancy. Some of these additional factors include, but are not limited to: mortality rate of adults and infants, alcohol consumption, immunizations of specific diseases, deaths by specific diseases, Gross Domestic Product, and population of the country. All of the data relates back to a specific country.

Dataset Interest

Anyone who has stayed up to date with the news will have likely heard about the recent “anti-vaxers” movement. Essentially, people are starting to believe that vaccinations actually have a negative impact on quality of life and life expectancy, and can also lead to autism. While many of these disadvantages of not being immunized, especially the connection to autism, have been disproven, many still do not get vaccinated. Certainly, everyone is entitled to their own choices and beliefs, so we are hoping that an analysis of this dataset may shed some scientific light on both sides of the arguments, right or wrong. While there have been many datasets about life expectancies, there are not many that accompany immunization statistics with it.

Furthermore, we hope, through studying and modeling the relationship between life expectancy and various factors, to identify some of the biggest factors that influence life expectancy. Through this, we could demonstrate the biggest leverage that an under-developed country could take action on fairly easily to improve the life expectancy of their people.

Methods

Import libraries to use

library(lmtest)
library(faraway)
library(readr)
library(knitr)

Load the data into R

life_expectancy <- read.csv('Life Expectancy Data.csv')

Analyze the plots between life expectancy and the rest of the predictors to do a manual selection of potentially useful predictors

pairs(life_expectancy[, c(4, 1, 2, 3, 5)], col = 'dodgerblue')

pairs(life_expectancy[, c(4, 5, 6, 7, 8)], col = 'dodgerblue')

pairs(life_expectancy[, c(4, 9, 10, 11, 12)], col = 'dodgerblue')

pairs(life_expectancy[, c(4, 13, 14, 15, 16)], col = 'dodgerblue')

pairs(life_expectancy[, c(4, 17, 18, 19)], col = 'dodgerblue')

pairs(life_expectancy[, c(4, 20, 21, 22)], col = 'dodgerblue')

Data Cleaning

## remove rows with NA 
subdata = life_expectancy[complete.cases(life_expectancy), ]
## remove rows with Income.composition.of.resources=0
subdata = subset(subdata,subdata$Income.composition.of.resources!=0)
## remove rows with BMI less than 8
subdata = subset(subdata,subdata$BMI>8)

Below are the plots showing we have discovered data issue where some values appear to be miss coded and miss a digit. We have come up with a method to exclude values that are way too off, compared to other years within the same country.

##View(subdata[1:100,])
par(mfrow=c(1,2))
plot(subdata$Year[subdata$Country=="Swaziland"],subdata$Adult.Mortality[subdata$Country=="Swaziland"],
     col = 'dodgerblue',xlab="Year",ylab="Adult.Mortality",main="Adult.Mortality in Swazilan",cex.main=0.8)
plot(subdata$Year[subdata$Country=="Swaziland"],subdata$Hepatitis.B[subdata$Country=="Swaziland"],
      col = 'dodgerblue',xlab="Year",ylab="Hep.B",main="Hep.B immunization in Swaziland",cex.main=0.8)

Below are the plots showing we have discovered the value of some columns appear to be corrupted beyond repair and we will exclude those columns in our modeling. As an example here are the GDP and population, percentage.expenditure on health data of Australia over the years. The GDP data appears to be broken and the percentage.expenditure on health seems to be related and is also broken.

par(mfrow=c(1,3))
plot(subdata$Year[subdata$Country=="Australia"],subdata$GDP[subdata$Country=="Australia"],
     col = 'dodgerblue',xlab="Year",ylab="GDP",main="GDP in Australia",cex.main=0.8)
plot(subdata$Year[subdata$Country=="Australia"],subdata$Population[subdata$Country=="Australia"],
      col = 'dodgerblue',xlab="Year",ylab="Population",main="Population in Australia",cex.main=0.8)
plot(subdata$Year[subdata$Country=="Australia"],subdata$percentage.expenditure[subdata$Country=="Australia"],
      col = 'dodgerblue',xlab="Year",ylab="percentage.expenditure",main="percentage.expenditure in Australia",cex.main=0.8)

We attempt to exclude extreme values for Adult.Mortality

## Number of reords before the clean up
nrow(subdata)
## [1] 1443
## create list of unique countries
country_unique=unique(subdata$Country)
country_length=length(country_unique)


## Create data frame to store the average adult mortality data for each country
country_avg_adult_mort=data.frame(country=as.character(rep("",country_length)),
                                  avg_adult_mort=rep(0,country_length))
## turn country names into character for easier comparison
country_avg_adult_mort$country=as.character(country_avg_adult_mort$country)

## for each iteration we calculate avg adult mortality for each country and exclude abnormal records
## abnormal records = < 20%*average Adult.Mortality of that country or >5 * average Adult.Mortality of that country
for (i in 1:country_length) {
  country_avg_adult_mort$country[i]=country_unique[i]
  country_avg_adult_mort$avg_adult_mort[i]=mean(subdata$Adult.Mortality[ subdata$Country == country_unique[i] ])
  
  ## each iteration exclude identified abnormal records
  subdata=subdata[ !(subdata$Country == country_unique[i] & (
    subdata$Adult.Mortality < (country_avg_adult_mort$avg_adult_mort[i]/5)
    | subdata$Adult.Mortality > (country_avg_adult_mort$avg_adult_mort[i]*5))),]
}

## number of records post clean up
nrow(subdata)
## [1] 1234

We attempt to exclude extreme values for Polio

## Number of reords before the clean up
nrow(subdata)
## [1] 1234
## create list of unique countries
country_unique=unique(subdata$Country)
country_length=length(country_unique)


## Create data frame to store the average Polio (Pol3) immunization coverage data for each country
country_avg_polio=data.frame(country=as.character(rep("",country_length)),
                                  avg_polio=rep(0,country_length))
## turn country names into character for easier comparison
country_avg_polio$country=as.character(country_avg_polio$country)

## for each iteration we calculate avg Value for each country and exclude abnormal records
## abnormal records = < 20%*average of that country or >5 * average of that country
for (i in 1:country_length) {
  country_avg_polio$country[i]=country_unique[i]
  country_avg_polio$avg_polio[i]=mean(subdata$Polio[ subdata$Country == country_unique[i] ])
  
  ## each iteration exclude identified abnormal records
  subdata=subdata[ !(subdata$Country == country_unique[i] & (
    subdata$Polio < (country_avg_polio$avg_polio[i]/5)
    | subdata$Polio > (country_avg_polio$avg_polio[i]*5))),]
}

## number of records post clean up
nrow(subdata)
## [1] 1159

We attempt to exclude extreme values for Diphtheria

## Number of reords before the clean up
nrow(subdata)
## [1] 1159
## create list of unique countries
country_unique=unique(subdata$Country)
country_length=length(country_unique)


## Create data frame to store the average Diphtheria immunization coverage data for each country
country_avg_Diphtheria=data.frame(country=as.character(rep("",country_length)),
                                  avg_Diphtheria=rep(0,country_length))
## turn country names into character for easier comparison
country_avg_Diphtheria$country=as.character(country_avg_Diphtheria$country)

## for each iteration we calculate avg Value for each country and exclude abnormal records
## abnormal records = < 20%*average of that country or >5 * average of that country
for (i in 1:country_length) {
  country_avg_Diphtheria$country[i]=country_unique[i]
  country_avg_Diphtheria$avg_Diphtheria[i]=mean(subdata$Diphtheria[ subdata$Country == country_unique[i] ])
  
  ## each iteration exclude identified abnormal records
  subdata=subdata[ !(subdata$Country == country_unique[i] & (
    subdata$Diphtheria < (country_avg_Diphtheria$avg_Diphtheria[i]/5)
    | subdata$Diphtheria > (country_avg_Diphtheria$avg_Diphtheria[i]*5))),]
}

## number of records post clean up
nrow(subdata)
## [1] 1121

We attempt to exclude extreme values for Hepatitis.B

## Number of reords before the clean up
nrow(subdata)
## [1] 1121
## create list of unique countries
country_unique=unique(subdata$Country)
country_length=length(country_unique)


## Create data frame to store the average Diphtheria immunization coverage data for each country
country_avg_Hepatitis_B=data.frame(country=as.character(rep("",country_length)),
                                  avg_Hepatitis_B=rep(0,country_length))
## turn country names into character for easier comparison
country_avg_Hepatitis_B$country=as.character(country_avg_Hepatitis_B$country)

## for each iteration we calculate avg Value for each country and exclude abnormal records
## abnormal records = < 20%*average of that country or >5 * average of that country
for (i in 1:country_length) {
  country_avg_Hepatitis_B$country[i]=country_unique[i]
  country_avg_Hepatitis_B$avg_Hepatitis_B[i]=mean(subdata$Hepatitis.B[ subdata$Country == country_unique[i] ])
  
  ## each iteration exclude identified abnormal records
  subdata=subdata[ !(subdata$Country == country_unique[i] & (
    subdata$Hepatitis.B < (country_avg_Hepatitis_B$avg_Hepatitis_B[i]/5)
    | subdata$Hepatitis.B > (country_avg_Hepatitis_B$avg_Hepatitis_B[i]*5))),]
}

## number of records post clean up
nrow(subdata)
## [1] 1085

Create a detaset that contains columns that we believe have predicting power of life expectancy and exclude columns that has unreasonable and unrepairable data. We exclude GDP, percentage.expenditure and population due to data issue.

## full data excluding year, percentage.expenditure, GDP and Population
life_exp_full_data = subdata[,c(-1,-8,-17,-18)]
## Selected columns
life_exp_clean_data = subdata[,c(2, 3, 4, 5, 7, 11, 13, 15, 16, 21, 22)]

## Selected potential predictor to perform exhaustive search for models that perform well and pass assumption tests. ## Therefore not including schooling because it is corelated with Income.composition.of.resources
## Also excluded Diphtharia due to colinearity with Polio
life_exp_search_data = subdata[,c(2, 3, 4, 5, 7, 11, 13, 16, 21)]

pairs(subdata[,c(4,13,15)])

Fit an additive linear regression model for all available predictors

le_full_add_model <- lm(Life.expectancy ~ ., data = life_exp_full_data)

Fit an additive linear regression model for all predictors we manually analyzed as useful

le_add_model <- lm(Life.expectancy ~ ., data = life_exp_clean_data)

Fit an interaction linear regression model up to all 2 way interactions for all available predictors

le_inter_model <- lm(Life.expectancy ~ . * ., data = life_exp_clean_data)

Model Diagnostics

Leverage

Identify high leverage data points

full_add_lev <- hatvalues(le_full_add_model) > 2 * mean(hatvalues(le_full_add_model))
add_lev <- hatvalues(le_add_model) > 2 * mean(hatvalues(le_add_model))
inter_lev <- hatvalues(le_inter_model) > 2 * mean(hatvalues(le_inter_model))

Influential

Identify influential data points

full_add_inf <- cooks.distance(le_full_add_model) > 4 / length(cooks.distance(le_full_add_model))
add_inf <- cooks.distance(le_add_model) > 4 / length(cooks.distance(le_add_model))
inter_inf <- cooks.distance(le_inter_model) > 4 / length(cooks.distance(le_inter_model))

Data Removal & Variable Selection

Refit models without high leverage or influential data poitns

le_full_add_model <- lm(Life.expectancy ~ ., data = life_exp_full_data, subset = !full_add_lev | !full_add_inf)
le_add_model <- lm(Life.expectancy ~ ., data = life_exp_clean_data, subset = !add_lev | !add_inf)
le_inter_model <- lm(Life.expectancy ~ . * ., data = life_exp_clean_data, subset = !add_lev | !add_inf)

Perform backwards AIC to figure out which predictors are useful

le_full_add_model_sel <- step(le_full_add_model, trace = 0)
le_add_model_sel <- step(le_add_model, trace = 0)
le_inter_model_sel <- step(le_inter_model, trace = 0)

Normality

Check normality assumption through a QQ-Plot for each of the models. The top 3 are the selected models, while the bottom 3 are the base models we started with.

par(mfrow= c(2, 3))

qqnorm(resid(le_full_add_model_sel),main="qqnorm for le_full_add_model_sel",cex.main=0.8)
qqline(resid(le_full_add_model_sel))

qqnorm(resid(le_add_model_sel),main="qqnorm for le_add_model_sel",cex.main=0.8)
qqline(resid(le_add_model_sel))

qqnorm(resid(le_inter_model_sel),main="qqnorm for le_inter_model_sel",cex.main=0.8)
qqline(resid(le_inter_model_sel))

qqnorm(resid(le_full_add_model),main="qqnorm for le_full_add_model",cex.main=0.8)
qqline(resid(le_full_add_model))

qqnorm(resid(le_add_model),main="qqnorm for le_add_model",cex.main=0.8)
qqline(resid(le_add_model))

qqnorm(resid(le_inter_model),main="qqnorm for le_inter_model",cex.main=0.8)
qqline(resid(le_inter_model))

Equal Variance

Check equal variance assumption for the full additive model

par(mfrow = c(2, 3))

plot(fitted(le_full_add_model), resid(le_full_add_model))
plot(fitted(le_add_model), resid(le_add_model))
plot(fitted(le_inter_model), resid(le_inter_model))

hist(resid(le_full_add_model))
hist(resid(le_add_model))
hist(resid(le_inter_model))

Model Building Through Assumptions

High level summary of assumption violations and P-Values of initialy fitted models

report_frame=data.frame(
  Model_Name=c("Full_add","Full_manual_add","Full_int","AIC_add","AIC_manual_add","AIC_int"),
  R_squared=c(summary(le_full_add_model)$r.squared,
              summary(le_add_model)$r.squared,
              summary(le_inter_model)$r.squared,
              summary(le_full_add_model_sel)$r.squared,
              summary(le_add_model_sel)$r.squared,
              summary(le_inter_model_sel)$r.squared),
  BP_test_pvalue=c(bptest(le_full_add_model)$p.value,
              bptest(le_add_model)$p.value,
              bptest(le_inter_model)$p.value,
              bptest(le_full_add_model_sel)$p.value,
              bptest(le_add_model_sel)$p.value,
              bptest(le_inter_model_sel)$p.value),
  shapiro_test=c(shapiro.test(resid(le_full_add_model))$p.value,
              shapiro.test(resid(le_add_model))$p.value,
              shapiro.test(resid(le_inter_model))$p.value,
              shapiro.test(resid(le_full_add_model_sel))$p.value,
              shapiro.test(resid(le_add_model_sel))$p.value,
              shapiro.test(resid(le_inter_model_sel))$p.value)
)
kable(report_frame)
Model_Name R_squared BP_test_pvalue shapiro_test
Full_add 0.9200431 0.0000003 0
Full_manual_add 0.9157945 0.0000014 0
Full_int 0.9344324 0.0009214 0
AIC_add 0.9198806 0.0001289 0
AIC_manual_add 0.9157214 0.0000247 0
AIC_int 0.9338789 0.0010885 0

We attempt to search for a model from all possible combinations of selected additive predictors that has a fair R-Squared value and passes both assumption tests.

## prepare a list of input formulas that captures all permutation and combination of predictors from the columns that we selected. 
call_list=rep("",250)
predictor_list=names(life_exp_search_data)
predictor_list=predictor_list[-3]
counter=1
for (i in 1:length(predictor_list)){
  permutation_predictors=t(combn(predictor_list,i))
  for (j in 1:nrow(permutation_predictors)){
    new_set="Life.expectancy ~ "
    for (k in 1:length(permutation_predictors[j,])){
      new_set=paste(new_set," ",permutation_predictors[j,k],"+")
    }
    call_list[counter]=substring(new_set,1,nchar(new_set)-1)
    counter=counter+1
  }
}

## create a result data frame that contains model r.sqaures, bptest pvalue and shapiro test pvalue
result_frame=data.frame(model=as.character(rep("",length(call_list))),
                        r_sq=rep(0,length(call_list)),
                        bp_pval=rep(0,length(call_list)),
                        shapiro_pval=rep(0,length(call_list)))
result_frame$model=as.character(result_frame$model)

for (i in 1:length(call_list)){
  mod=lm(call_list[i],data=life_exp_search_data)
  result_frame$model[i]=call_list[i]
  result_frame$r_sq[i]=summary(mod)$r.squared
  result_frame$bp_pval[i]=bptest(mod)$p.value
  result_frame$shapiro_pval[i]=shapiro.test(resid(mod))$p.value
}

Pick a model from the few models that pass the BP test.

par(mfrow = c(1, 2))
hist(result_frame$bp_pval)
hist(result_frame$shapiro_pval)

kable(result_frame[result_frame$bp_pval>0.05, ])
model r_sq bp_pval shapiro_pval
1 Life.expectancy ~ Year 0.0016263 0.5889239 0
4 Life.expectancy ~ Alcohol 0.1572469 0.2533965 0
11 Life.expectancy ~ Year + Alcohol 0.1674815 0.2678070 0
22 Life.expectancy ~ Adult.Mortality + Alcohol 0.8087177 0.0699786 0
43 Life.expectancy ~ Year + Adult.Mortality + Alcohol 0.8098278 0.1093357 0
197 Life.expectancy ~ Year + BMI + Polio + HIV.AIDS + Income.composition.of.resources 0.8346374 0.0577006 0
233 Life.expectancy ~ Year + Status + BMI + Polio + HIV.AIDS + Income.composition.of.resources 0.8351157 0.0773988 0

We choose the model with predictor Year + Status + BMI + Polio + HIV.AIDS + Income.composition.of.resources with 0.835 r_square value and barely passing the BP test as our 7th candidate model

bp_search_mod = lm(Life.expectancy ~ Year + Status + BMI + Polio + HIV.AIDS + Income.composition.of.resources,data=life_exp_search_data)

Collinearity

Calculate the variance inflation factor for the predictors in each model

vif_full_add <- vif(le_full_add_model)
vif_add <- vif(le_add_model)
vif_inter <- vif(le_inter_model)
vif_full_sel <- vif(le_full_add_model_sel)
vif_add_sel <- vif(le_add_model_sel)
vif_inter_sel <- vif(le_inter_model_sel)
vif_bp <- vif(bp_search_mod)

Model Selection

Perform an ANOVA test to see which model is preferred between the different models, with an \(\alpha = .05\)

anova(le_add_model_sel, le_add_model)[2, 'Pr(>F)'] < .05
## [1] FALSE
anova(le_full_add_model_sel, le_full_add_model)[2, 'Pr(>F)'] < .05
## [1] FALSE
anova(le_inter_model_sel, le_inter_model)[2, 'Pr(>F)'] < .05
## [1] FALSE
anova(le_add_model_sel, le_inter_model_sel)[2, 'Pr(>F)'] < .05
## [1] TRUE

Results

Overall we fitted and selected 7 candidate models and the below is a summary of how effective do they make predictions and their assumption test scores.

report_frame=data.frame(
  Model_Name=c("Full_add","Full_manual_add","Full_int","AIC_add","AIC_manual_add","AIC_int","BP_Search"),
  R_squared=c(summary(le_full_add_model)$r.squared,
              summary(le_add_model)$r.squared,
              summary(le_inter_model)$r.squared,
              summary(le_full_add_model_sel)$r.squared,
              summary(le_add_model_sel)$r.squared,
              summary(le_inter_model_sel)$r.squared,
              summary(bp_search_mod)$r.squared),
  BP_test_pvalue=c(bptest(le_full_add_model)$p.value,
              bptest(le_add_model)$p.value,
              bptest(le_inter_model)$p.value,
              bptest(le_full_add_model_sel)$p.value,
              bptest(le_add_model_sel)$p.value,
              bptest(le_inter_model_sel)$p.value,
              bptest(bp_search_mod)$p.value),
  shapiro_test=c(shapiro.test(resid(le_full_add_model))$p.value,
              shapiro.test(resid(le_add_model))$p.value,
              shapiro.test(resid(le_inter_model))$p.value,
              shapiro.test(resid(le_full_add_model_sel))$p.value,
              shapiro.test(resid(le_add_model_sel))$p.value,
              shapiro.test(resid(le_inter_model_sel))$p.value,
              shapiro.test(resid(bp_search_mod))$p.value),
  VIF_greater_5=c(sum(vif_full_add>5),
                  sum(vif_add>5),
                  sum(vif_inter>5),
                  sum(vif_full_sel>5),
                  sum(vif_add_sel>5),
                  sum(vif_inter_sel>5),
                  sum(vif_bp>5))
)
kable(report_frame)
Model_Name R_squared BP_test_pvalue shapiro_test VIF_greater_5
Full_add 0.9200431 0.0000003 0 10
Full_manual_add 0.9157945 0.0000014 0 4
Full_int 0.9344324 0.0009214 0 46
AIC_add 0.9198806 0.0001289 0 4
AIC_manual_add 0.9157214 0.0000247 0 2
AIC_int 0.9338789 0.0010885 0 35
BP_Search 0.8351157 0.0773988 0 0

Despite the model selected through exhaustive assumption test search passes BP test and show very little variable inflation. But the prediction accruacy is in general lacking behind the other models. We would like to compare the rest of the model through anova test

AIC_manual_add Vs Full_manual_add

AIC_add Vs Full_add

AIC_int Vs Full_int

With three candidates, we firstly proceed to eliminate the interactive model chosen through backward AIC. Despite its superior prediction accuracy, it is very complex and has a large amount of collinearity. Both work agaisnt our goal of analyzing the relationship between various factors and life expectancy. We proceed to compare AIC_manual_add and AIC_add

summary(le_full_add_model_sel)
## 
## Call:
## lm(formula = Life.expectancy ~ Year + Status + Adult.Mortality + 
##     infant.deaths + Hepatitis.B + under.five.deaths + Total.expenditure + 
##     Diphtheria + thinness.5.9.years + Income.composition.of.resources + 
##     Schooling, data = life_exp_full_data, subset = !full_add_lev | 
##     !full_add_inf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4215  -1.4364  -0.1576   1.0798   8.6023 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      7.0438133 38.4006353   0.183 0.854496    
## Year                             0.0270911  0.0191377   1.416 0.157193    
## StatusDeveloping                -0.7656835  0.2472296  -3.097 0.002007 ** 
## Adult.Mortality                 -0.0424938  0.0009921 -42.832  < 2e-16 ***
## infant.deaths                    0.0437644  0.0096916   4.516 7.03e-06 ***
## Hepatitis.B                     -0.0257740  0.0071264  -3.617 0.000313 ***
## under.five.deaths               -0.0347829  0.0072374  -4.806 1.76e-06 ***
## Total.expenditure                0.0971491  0.0344895   2.817 0.004942 ** 
## Diphtheria                       0.0486858  0.0106325   4.579 5.23e-06 ***
## thinness.5.9.years              -0.0686795  0.0215760  -3.183 0.001500 ** 
## Income.composition.of.resources 26.3454090  1.6731220  15.746  < 2e-16 ***
## Schooling                       -0.2040486  0.0733882  -2.780 0.005527 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.428 on 1047 degrees of freedom
## Multiple R-squared:  0.9199, Adjusted R-squared:  0.919 
## F-statistic:  1093 on 11 and 1047 DF,  p-value: < 2.2e-16
summary(le_add_model_sel)
## 
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + BMI + 
##     Diphtheria + Income.composition.of.resources + Schooling, 
##     data = life_exp_clean_data, subset = !add_lev | !add_inf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.219  -1.433  -0.132   1.100   9.764 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     59.820401   0.899330  66.517  < 2e-16 ***
## StatusDeveloping                -0.687014   0.245656  -2.797 0.005257 ** 
## Adult.Mortality                 -0.043859   0.001000 -43.850  < 2e-16 ***
## BMI                              0.028282   0.007516   3.763 0.000177 ***
## Diphtheria                       0.041949   0.007801   5.377  9.3e-08 ***
## Income.composition.of.resources 24.023562   1.749558  13.731  < 2e-16 ***
## Schooling                       -0.146572   0.072283  -2.028 0.042837 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.478 on 1058 degrees of freedom
## Multiple R-squared:  0.9157, Adjusted R-squared:  0.9152 
## F-statistic:  1916 on 6 and 1058 DF,  p-value: < 2.2e-16

Judging by the VIF in the summary table, we know that AIC_add model has more variables that are correlated and the fact that AIC_add model has almost double the amount of predictors than AIC_manual_add moedel. We would like to take a closer look at AIC_manual_add model and see if the coefficients make sense

Status: (coef -0.68701357) The model shows with everything held equal, developing countries have 0.687014 shorter life expectancy then developed countries. This makes a lot of sense.

Adult.Mortality: (coef -0.04385859) The model shows that the higher the amount of people who die between 15 and 60 years per 1000 population, the lower the life expectancy of that country. Makes sense.

BMI: (coef 0.02828238) The model shows the higher Average Body Mass Index of entire population the higher the life expectancy of that country. Makes sense since BMI in general represents the nutrition level of the whole population.

Diphtheria: (coef 0.04194903) The model shows the higher Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%), the higher the life expectancy of that country. Makes sense; this directly supports our goal of providing evidence that immunization helps overall life expectancy. We also found a high correlation between Diphtheria and Polio which means this positive relation between Diphtheria and life expectancy also applies to Polio immunization and immunization in general.

pairs(subdata[,c(4,13,15)])

Income.composition.of.resources: (coef 24.02356163) The model shows the higher Human Development Index in terms of income composition of resources, the higher the life expectancy of that country. Makes sense since richer countries and more developed countries have better health care and thus healthier populations.

Schooling: (coef -0.14657182) The coefficient of Number of years of Schooling (years) does not make intuitive sense, but knowing that Schooling is correlated with Income.composition.of.resources does. So the combination these factors should provide accurate predictions.

pairs(subdata[,c(4,21,22)])

We will check the model assumption of the final selected model

par(mfrow = c(1, 2))
qqnorm(resid(le_add_model_sel),main="QQ Norm plot for AIC_manual_add",cex.main=0.8)
qqline(resid(le_add_model_sel))
plot(fitted(le_add_model), resid(le_add_model),main="QQ Norm plot for AIC_manual_add",cex.main=0.8)
abline(h = 0, col = "darkorange", lwd = 2)

Despite that the model failed the Shapiro test, but given the fact that we searched through all combination of predictors and failed to find any additive model that could pass the Shapiro test, we believe this issue could be due to the given data set is not able to cover all variance. Furthermore, the life expectancy of a country is an immensely complex topic and the number of deciding factors could be hundreds

The fitted vs residual plot does not look as bad as the QQ norm plot. Despite the model failed BP test, we don’t see a clear pattern in the fitted vs residual graph.

Ultimately, we decided to choose the additive model that was selected through backward AIC from the poll of predictors that we manually selected.

Discussion

The first thing we did was perform data cleaning on our dataset. Upon intial inspection, we realized that there was a lot of data that had NA values, as well as some outliers. So, we cleaned up the source data by excluding records with NA value in the columns of interest. Also, we excluded records with unreasonable records. For example, Income.composition.of.resources = 0 does not seem reasonable comparing to the overall distribution. Additionally, upon further investigating the data, we found several data issues with some columns where data varies greatly for the same country year over year. In many cases, it appears to be input error (i.e. missing a digit). We try to eliminate records that are out of ordinary by comparing to the average of the country for some columns. But there are some columns that are so messy that we deem beyond repair, therefore are excluded from our analysis. We also created plots to help aid in this process, all of which can be seen in the methods section.

Once we cleaned up the data amongst various columns and rows, we created a pairs plot for every predictor as it related to life expectancy, the response we chose to use for this project. Looking at the pairs plot, we selected a series of predictors that we felt were useful to the dataset, but also predictors that we thought would be interesting to analyze. Then, we decided to fit 3 models: an additive model with all available predictors, an additive model with the predictors we manually selected, and a 2 way interaction model with the manually selected predictors, as well. We chose to do this because we knew the dataset had too many predictors to be able to accurately fit anything beyond an additive model. So, we included the full additive model for comparison, but were ultimately focused on the predictors that we manually selected.

Next, we did some diagnostics of the various models we were using. First, we identified high leverage and influential points in the dataset, according to the standard procedure outlined in the textbook. Once we identified these data points, we removed them from the dataset and refit the models without them. After this, we performed backwards AIC selection on the models to remove some of the predictors that were not useful. Now we have 6 models, the original 3 models we created, as well as their selected couterparts. But are these models good, LINE according models? To figure this out, we performed various diagnostics. We plotted a QQ plot for each model to test their normality assumptions. What we found was that the graphs suggested a violation of assumptions, but as we know, the graphs can be misleading and not always clear cut to read. So, we also performed the Shapiro-Wilks test, and found that they indeed were suspect for the normality assumption. Furthermore, we used a fitted vs residual plot, a histogram of the residuals, and a BP test on each model to test the equal variance assumption. The fitted vs residual, while not perfect, looked pretty good overall. The data, for the most part, was centered around 0, though there was certainly outlier data. Also, the data seemed to be more highly concentrated to the right side of the graph. The histogram of the residuals, upon analysis, did seem to follow a normal distribution. The BP tests, however, all failed to pass, causing suspicion of the equal variance assumption. Despite the high R-Squared value overall, all the models failed the BP test and Shapiro test which indiate that the homoscedasticity assumption and normality assumption are violated.

In an effort to find a better model, we attempt to search for a model from all possible combinations of selected additive predictors that has a fair R-Squared value and passes both assumption tests. We searched through 255 combinations of models and there were only a handful of models that pass the BP test; none seemed to pass the Shapiro test. So we proceed to pick one from the few models that pass the BP test. We choose the model with predictors Year + Status + BMI + Polio + HIV.AIDS + Income.composition.of.resources with 0.835 R-Squared value and barely passing the BP test as our 7th candidate model.

Next, we analyzed the collinearity of each model and performed an ANOVA test on each model with its selected counterpart. Through the ANOVA test, the two way interaction model was selected. However, through analyzing the collinearity of each model’s predictors, as well as taking into consideration our goal of analyzing immunization as it relates to life expectancy, we chose the manually selected, additive model derived from the backwards AIC process. We felt that this model would be better for predicting life expectancy as it relates to immunization as opposed to the ANOVA selected interaction model.

Appendix